Skip to main content

View Post [edit]

Poster: Albretch Date: Sep 29, 2014 12:59am
Forum: texts Subject: Heritrix: data only renderings and consolidation + remote to local address mappings ...

Does Heritrix have features to: a) deal with the crappy html you find online b) deal with javascript generated pages and CSS sprites c) just parse the data out of pages, while discarding all the non-sense rendering and goo such as ads (of course based on streaming SAX-like algorithms (not DOM)), with the option to notice: c.1) when the actual data page hasn't changed (but just pictures and ads in it and such things) c.2) when just new data (such as comments) have been appended to previous discussions c.3) changes in the metadata about pages and if they reflected actual changes in the data d) of course, §c should be based on a per site/URL strategy, which would determined the kind of scraping strategy to be used e) maintain (kind of) content URI's and their various authorings in order to track what are basically different renderings of the same text. Say, you have already downloaded a page containing the text from some author, then you would like to know if that is the same data/text and/or the changes which have been made and where exactly f) ways to do remote to local address mappings (via the *nix host-like strategy) g) ways to crawl the so-called "deep web" and somehow track how you got to those pages? Thank you very much, lbrtchx archive.org: texts: Heritrix: data only renderings and consolidation + remote to local address mappings ... ~ $ date Mon Sep 29 04:00:53 EDT 2014 and, of course, I assume Heritrix has all other features found in the *nix utility wget wget --help and as part of the transformations spec'ed in §d, you should be able to call parametrized routines to parse the text or html renderings. lbrtchx
This post was modified by Albretch on 2014-09-29 07:59:59